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1 METHOD AND SYSTEM FOR AGGREGATING DATA DISTRIBUTION MODELS 

2 

3 BACKGROUND OF THE INVENTION 

4 1. Field of the Invention 

5 The invention relates to method and system for aggregating data distribution models. 

6 More specifically, the invention provides a system and method of aggregating or combining data 

7 distribution models while correctly maintaining the applicability of statistical and analytical 

8 techniques. 

=Q 9 2. Description of the Prior Art and Related Information 

l & 10 When a researcher or engineer is involved in the analysis of a large quantity of data, 

fiH 1 1 called a data distribution, some summarization of the data elements in the data distribution is 

j : ~ VI generally necessary because of ike limitations of existing hardware and software. Generally this 

^ 13 summarization of the distribution center and spread, including mean, sigma (standard deviation) 

(P 14 and the number of data elements in the data distribution, are used. Alternatively sampling may 

v p 15 be used to extract a smaller, more manageable subset of data that can be analyzed. 

16 Unfortunately summarizing the data elements with a mean and sigma assumes that the 

17 distribution is Gaussian, or normal, and represents a single distribution and is not a mixture of 

18 independent distributions. Similarly, sampling may miss important elements of the distribution, 

19 for example outliers or bimodal patterns, unless the sample is sufficiently large. 

20 Thus, there is a need for a system for advanced analysis that provides the benefit of 

21 maintaining the overall shape and characteristics of the data distribution while keeping the data 

22 storage requirements to a minimum. There is a further need for a system that can perform 

23 aggregation of small subgroups of the data distribution, thus keeping computation needs to a 

24 minimum. There is a further need for such a system with which statistical tests can be properly 

25 performed without making assumptions about the data distribution. There is a further need for a 

26 system with which complex analytics can be performed including basic statistical functions, such 

1 
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1 as mean, minimum, maximum, standard deviation, etc. can be performed as well as complicated 

2 correlation and modeling studies. There is a further need for a system that naturally weights the 

3 highest data concentrations with the greatest accuracy in the approximation, wherein outliers are 

4 de-emphasized but not removed. There is a further need for a system with which an 

5 approximation of the original data distribution can be rebuilt from the model and estimates of the 

6 errors in this rebuilding can be made, 

7 SUMMARY OF THE INVENTION 

8 A system for creating an aggregated data model from a plurality data distribution models 

9 is disclosed. Each data distribution model is a summarized version of a data distribution having 

10 one or more data elements. Each data element has a value. Each data distribution model ks one 

1 1 or more bins, wherein each bin approximates a subset of the data elements. Each bin comprises a 

12 start point having a value, an end point having a value, and a polynomial formula that 

13 approximates the data elements for the bin. Each data distribution model thus comprises a 

14 summarized representation of a data distribution, wherein the aggregated data model represents a 

15 combination of two or more of the data distribution models. 

16 The system includes a processor for executing a computer program that is executable on a 

17 processor. 

18 Th e computer program is adapted to perform a plurality of steps in a method for creating 

19 the aggregated data model. The computer program may contain a plurality of modules for 

20 performing the steps. One step comprises determining which start point has the minimum value 

21 and which end point has the maximum value of all of the bins of all of the data distribution 

22 models. The next step performed is setting a start point of a first bin of the aggregated data 

23 model to said start point determined to have the minimum value. The next step is setting an end 

24 point of a last bin of the aggregated data model to said end point determined to have the 

25 maximum value. The next step comprises determining a total number of points for the 

26 aggregated data model by adding the values indicating the number of data elements from all bins 
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1 from all data distribution models, each point comprising an approximated value of a data element 

2 from one of the data distribution models. The next step comprises approximating the data 

3 elements in the data distribution described by each data distribution model using the Start point, 

4 polynomial formula, and number of data elements for each bin in each respective data 

5 distribution model, each approximated data element comprising one point in the aggregated data 

6 model. The next step is to sort the points from minimum to maximum. The next step comprises 

7 distributing the points into one or more bins in the aggregated data model such that a 

8 substantially equal number of points are in each bin of the aggregated data model. The end point 

9 of each bin in the aggregated data model may then be determined. The next step comprises 

10 determining a polynomial formula for the points for each bin of the aggregated data model. 

1 1 The computer program may create the bins of the aggregated data model on a bin by bin 

12 basis. In that case the steps of approximating the data elements for the points of each bin, 

13 determining the end point for each bin, and determining the polynomial formula for each bin are 

14 performed for each bin individually. 

15 BRIEF DESCRIPTION OF THE DRAWINGS 

16 Fig. 1 is a block diagram illustrating the major components of a system for creating an 

17 aggregated data model from a plurality data distribution models; 

18 Fig. 2 is a flow diagram illustrating steps that may be performed by the system of Fig. 1 ; 

19 Fig. 3 is a flow diagram illustrating steps that may be performed by the system of Fig. 1 for 

20 creating each of the plurality of data distribution models that are aggregated by the system Fig. 1; 

21 Fig. 4 is a flow diagram illustrating the steps performed by the computer program for 

22 determining the start points and end points of the bins for each data distribution model according 

23 to the method of Fig. 3 and system of Fig. 1; 

24 Fig, 5 is a graphic illustration of two data distributions represented as a histogram 

25 Fig. 6 is a graphic illustration of the data elements from one of the data distributions of Fig. 5 

26 divided into bins; 
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1 Fig. 7 is a graphic illustration of an approximation of the original data distribution from Fig. 6 

2 using the quadratic and spline fit verses a linear fit; 

3 Fig. 8 is a graphic illustration showing the approximation error if a data distribution of Fig. 1 is 

4 treated as a normal distribution verses if the distribution is treated as a non-normal distribution 

5 using the system of Fig. 1; and 

6 Fig. 9 is a graphic illustration of an aggregated data model aggregated from data distribution 

7 models of the data distributions of Fig. 5. 

8 DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

9 With reference to Fig. 1 , a system for creating an aggregated data model 100 from a 

10 plurality data distribution models 102 is shown. Each data distribution model 80 is a 

1 1 summarized version of a data distribution 58 having one or more data elements 56, each data 

12 element 56 having a value, each data distribution model 102 having one or more bins 80 for 

13 approximating a subset of the data elements, each bin comprising a start point having a value, an 

14 end point having a value, and a polynomial formula approximating the data elements for the 

15 respective bin. Each data distribution model 102 thus comprises a summarized representation of 

16 a data distribution 58, wherein the aggregated data model 100 represents a combination of two or 

17 more of the data distribution models 102. 

18 The system includes a processor 51 for executing a computer program 54 that is 

19 executable on a processor 5 1 . 

20 With reference to Fig. 2, the computer program 54 is adapted to perform a plurality of 

21 steps in a method for creating the aggregated data model 100. The computer program 54 may 

22 contain a plurality of modules 59 for performing the steps. One step comprises determining 

23 Which Start point has the minimum value and which end point has the maximum value of all of 

24 the bins 80 of all of the data distribution models 102, step 110. The next step performed is 

25 setting a start point of a first bin of the aggregated data model (a first bin of 1 80 in Fig. 1 80 

26 described below) to said start point determined to have the minimum value, step 1 12. The next 
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1 step is setting an end point of a last bin 1 80 of the aggregated data model to said end point 

2 determined to have the maximum value, step 1 14. The next step comprises determining a total 

3 number of points for the aggregated data model by adding the values indicating the number of 

4 data elements 56 from all bins from all data distribution models 102, each point comprising an 

5 approximated value of a data element 56 from one of the data distribution models 102, step 116. 

6 The next step comprises approximating the data elements 56 in the data distribution 85 described 

7 by each data distribution model 102 using the start point, polynomial formula, and number of 

8 data elements 56 for each bin 80 in each respective data distribution model 102, each 

9 approximated data element 56 comprising one point in the aggregated data model 100, step 1 18. 

10 The next step is to sort the points from minimum to maximum, step 120. The next step 

1 1 comprises distributing the points into one or more bins (180 in Fig. 9) in the aggregated data 

12 model 1 00 such that a substantially equal number of points are in each bin 1 80 of the aggregated 

13 data model 100, step 122. The end point of each bin 180 in the aggregated data model 100 may 

14 then be determined, step 124. The next step comprises determining a polynomial formula for the 

15 points for each bin 180 of the aggregated data model 100, step 126. 

16 The computer program may create the bins 1 80 of the aggregated data model 100 on a bin 

17 by bin basis. In that case the steps of approximating the data elements 56 for the points of each 

18 bin 1 80, determining the end point for each bin 1 80, and determining the polynomial formula for 

19 each bin are performed for each bin 180 individually. The data elements 56 corresponding to the 

20 bins 80 of the data distribution models 102 are thus approximated as needed using the respective 

21 polynomial formula for the bin 80 of the respective data distribution model 102 in which the 

22 needed data elements 56 are contained. This technique tends to conserve resources for the 

23 processor 5 1 because once each bin 1 80 in the aggregated data model 1 00 is created, and the 

24 polynomial formula for that bin 1 80 is determined, then the approximated data elements, or 

25 points, for that particular bin 1 80 may be discarded before processing the next successive bin 1 80 

26 for the aggregated data model. 
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1 The step of approximating the polynomial formula may comprise finding a quadratic 

2 formula having the test fit with the points by using mathematical techniques such as the least 

3 squares method. In the simplest case, the start and end points of the each bin 1 80 may be fit into 

4 a linear formula by calculating the slope between the start and end point, and the y intercept 

5 which is equal to the start point of the bin. Finding a polynomial formula is the preferred 

6 method. The term polynomial formula as used herein may include a linear formula, quadratic 

7 formula or other higher order polynomial formulas. 

8 The step of distributing the points between, or into, the bins 180 in the aggregated data 

9 model 100 may comprise dividing the number of total points in the aggregated data model 100 

10 by the number of bins in the aggregated data model 100. If the number of points in the 

1 1 aggregated data model 100 is not equally divisible by the number of bins 180, then the number of 

12 points in each bin 180 is determined by dividing the number of points by the number of bins 180, 

13 and then adding one to the count of the points in each of a number of bins 180 equal to the 

14 remainder after dividing, wherein the bins 1 80 that have one added to the count is determined 

15 according to the following formula: 

16 fork from 1 to r 

17 bin add -INT((n*k)/(r+l)) 

18 nextk 

19 wherein bin^d is the sequential bin number to add one to the count of points to include therein, 

20 n is the total number of bins 180 in the aggregated data model 100, r is the remainder from 

21 dividing the number of points in the aggregated data model 100 by the number of bins 1 80, and 

22 INT is a function for rounding the result of the bracketed formula to produce an integer result. 

23 Each of the data distribution models 102 that are used in aggregation may have been 

24 created using one of several different methods. With reference to Fig. 3, a flow diagram 

25 illustrating a method preformed by the computer program 54 for creating each of the one or more 
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1 data distribution models 102 from each of the one or more data distributions 58 is shown. The 

2 data elements 56 are sorted from minimum to maximum, step 300. The number of data elements 

3 56 in the data distribution 58 are computed, step 302. The value of the start point and the value 

4 of the end point of each bin 80 are determined by distributing, or dividing, the data elements 56 

5 into a plurality of substantially equal sized bins 80 for each data distribution 58, step 304, as 

6 explained in more detail with respect to Fig. 4 below, step 304. The number of data elements 56 

7 in each bin 80 are counted according to the following formula, step 306: 

8 start point < element value <= end point 

=D 9 wherein the start point is the start point of the respective bin 80, the element value is the value of 

O 10 each data element 56 in each bin, and end point is the end point of the respective bin 80. The 

:n 1 1 data distribution model 1 02 may thus be computed by setting, for each bin 80, the start point of 

r 12 the bin 80, the end point of the bin 80, and the number of data elements in the bin 80, step 308 

L 13 for a linear model, or adding the step of determining a polynomial formula for a data distribution 

- 14 model 102 that so uses one for approximating data elements 56 in each bin 80. 
:f 15 With reference to Fig. 4, a flow diagram illustrating the steps performed by the computer 

:3 16 program 54 for determining the start points and end points of the bins 80 for each data 

17 distribution model 102 according to the method of Fig. 3 and system of Fig. 1 is shown. The start 

18 point of the first bin 80 of the data distribution model 102 is selected as the value of the data 

19 element 56 having the minimum value in the sorted data distribution 58, step 400. The start 

20 point and end point of each bin 80 in the data distribution model 102 is determined according to 

21 the following criteria, step 402: 

22 (a) if the number of data elements 56 in the data distribution is equally divisible into the 

23 number of bins 80, step 404, the end point of the first bin 80 is equal to the value of 

24 the itii data element 56 in the data distribution 58, wherein i is the number of data 

25 elements 56 in each bin determined by dividing the data elements 56 equally into the 

26 number of bins 80, wherein the value of the end point of each bin 80 is equal to the 
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1 ith data element 56 after the last data element 56 in the proceeding bin 80, wherein 

2 the start point of each bin 80 is equal to data element 56 after the last data element 56 

3 of the previous bin 80, step 406, else 

4 (b) if the number of data elements 56 in the data distribution 58 is not equally divisible 

5 by the number of bins 80, then the number of data elements 56 in each bin 80 is 

6 determined by dividing the number of data elements 56 by the number of bins 80, and 

7 then adding one to the count of the data elements 56 in each of a number of bins 80 

8 equal to the remainder after dividing, wherein the bins 80 that have one added to the 

9 count is determined according to the following formula: 

10 fork from 1 tor 

11 bin add = INT((n*k)/(r+l)) 

12 next k 

13 wherein bin add is the sequential bin number to add one to the count of data elements to 

14 include therein, n is the total number of bins 80 in the data distribution model, r is the remainder 

15 from dividing the number of data elements 56 in the data distribution by the number of bins 80 in 

16 the data distribution model, and INT is a function for rounding the result of the bracketed 

17 formula to produce an integer result. 

IS The computer program may perform the step of computing the number of data elements 

19 56 in each bin 80 for the data distribution model 102 by counting, for each bin 80, each data 

20 element 56 satisfying the following formula: 

21 start point < element value <= end point 

22 wherein the bin start point is the start point of the respective bin 80, element value is the 

23 value of each data element 56 in each bin 80, and end point is the end point of the respective bin 

24 80. 
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1 A storage medium (70 in Fig. 1) may be provided for storing each data distribution model 

2 102 by storing, for each bin 80, the start point, the end point, the number of data elements 56, 

3 and the parameters of the polynomial formula that best approximates the data element 56 for the 

4 respective bin 80. Once the data distribution model 102 is stored, the original data distribution 

5 58 from which the model was built no longer needs to be referred to. The computer program 54 

6 may perform simple and complex statistical operations using the data model 102, or aggregations 

7 of two or more data models 102. For example, the computer program may determine the range 

m 8 of values of an aggregated data model 100 by subtracting the end point of the last bin in the 

;■□ 9 aggregated data model 100 from the start point of the first bin 180 in the aggregated data model 

ii] 10 100, without having to refer to the original data elements 56 in the data distributions 58. The 

; p 1 1 computer program 54 may further determine the median value of the aggregated data model 100 

J;- 12 by determining a number j computed by dividing the number of bins in the aggregated data 

^ 13 model by 2, and then reading the value of the end point of the jth bin as the median value if the 

|=M 14 number of bins 180 in the aggregated data model 100 is equally divisible by 2 or by reading the 

=. y 1 5 value of the mid point interpolated by the polynomial formula of the jth bin if the number of bins 

p 16 in the aggregated data model 1 00 is not equally divisible by 2. 

17 With reference to Fig. 5, an example of two data distributions 58 represented in 

18 histograms representing the result of a measurement of stroke time, or time to sweep the heads 

19 across the media, in magnetic or optical disk drives is shown. One data distribution 58 is for 

20 product M, and one data distribution 58 is for product W. The bins 80 of the data distribution 

21 models 102, shown as bars, give a histogram of the disk drives with a specific time interval 

22 relative to the total population, shown as a faction thereof, and each solid line 502 is a continuous 

23 approximation of the data distribution 58 of the disk drives by time. In both cases integration of 

24 the total area enclosed is equal to 1 . Note that product M has more data elements 56 than product 

25 W. 

26 With reference to Fig. 6, a graph of the data elements 56 divided into bins 80 from the 
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1 data distribution 58 for product M from Fig. 5 is shown. The graph shows division of the curve 

2 into 10 bins (characterized by 1 1 end points). Information about the bins 80, including the 

3 number of data elements 56 in each bin 80 and parameters of the polynomial formula 

4 approximating the values of the data elements 56 within each bin 80 are stored in the storage 

5 medium 70 as shown in the table below. 



Bi 


Start of Bin 


End of Bin 


#of 


Parameter 1 


Parameter 2 


n# 


(c) 




elements 


(a) 


(b) 


1 


8342 (min) 




2701 


3.82E-6 


5.4E-8 


2 


8503 




2705 


1.76E-5 


-6.1E-8 


3 


8540 




2701 


1.28E-5 


-1.07E-7 


4 


8569 




2674 


6.5E-6 


-1.26E-7 


5 


8596 




2732 


-4.00E-7 


-1.23E-7 


6 


8624 




2707 


-7.50E-6 


-9.88E-8 


7 


8655 




2640 


-1.39E-5 


-4.95E-8 


8 


8693 




2692 


-1.25E-5 


2.19E-8 


9 


9107 




2693 


-4.42E-7 


4.69E-9 


10 


9542 


10420 (max) 


2694 


-2.23E-6 


1.87E-9 



6 The table above shows the start points of the bins 80, which are the end points of 

7 preceding bins 80, the last end point of the last bin 80, and the parameters associated with a 

8 polynomial (spline) formula found using a spline fit for each bin 80 of the data elements 56 

9 approximated from a data distribution 102. For example, the eighth record in the table indicates 

10 that the quadratic formula found by the computer program 54 to have the best fit comprises: 

11 y= .0000125(x) 2 + .00000002. 19(x) +8693 

12 In order to derive each polynomial formula, the computer program 54 may use techniques to fit 

13 the best approximation of the data elements 56 in each bin 80 of the respective data distribution 

14 58 such as the least squares method, spline fit, linear fit, or other methods known to those skilled 

15 in the art for approximating the curve formed by the data elements 56. 
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1 The above table shows the minimum, maximum and median values of the data 

2 distribution 58 directly. Fig. 7 shows the approximation of the original data distribution 58 from 

3 Fig. 6 using the quadratic and spline fit verses a linear fit of the data distribution 58. As can be 

4 seen, a quadratic and spline fit is preferred because it offers a better approximation of the original 

5 raw data elements 56 that are also shown in Fig. 7. Typically it is expected that 100 bins 80 

6 should be used for better fits. When 100 bins are used, linear interpolation between the end 

7 points of a bin 80 can be used - requiring less storage space in storage means 70 due to the 

8 higher bin density. 

9 Thus, the data elements 56 from the original data distribution 58 can be reduced to 4*n+l 

10 points for cubic spline fits, 3*n+l points for quadratic fits, where n is the number of bins, or 

1 1 2*n+l points for linear fits. Therefore the sample data can be reduced from, for example, 26,939 

12 data elements to 3 1 for 10 bins using a quadratic fit to the data in each bin 80, or 201 elements 

13 using straight linear interpolation (linear fit) between the end points of the bins 80. 

14 The minimum value, maximum value and range may be directly read from the 

15 distribution summary as the starting element 56 of the first bin 80, the ending element 56 of the 

16 last bin 80 and the difference between maximum and minimum. 

17 The median is simply the value of the middle of the data distribution 58. It can be 

18 directly read by determining the value associated with the middle bin 80. For example the value 

19 of the end point of the 5 th bin for 10 bins, or the 50 th bin for 100 bins, or, for example, the 

20 interpolated middle of the 51 st bin for 101 bins using the polynomial representation for the 51 st 

21 bin. 

22 The inter quartile range is important in various statistical analysis. It can be found by 

23 subtracting the value of the 25 th percentile of the data which is the value of the end of the 25 th bin 

24 for n-100 bins, from the value of the 75 th percentiles, which is the value of the end of the 75 th bin 

25 for n=l 00 bins. 
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1 Because the mean and standard deviation are technically only applicable for a Gaussian, 

2 or normal, distributions, computation of these parameters may not be appropriate. If necessary 

3 though the standard method for computation is to assume the distribution is normal. In that case 

4 the median is equal to the mean and the standard deviation can be computed as the inter quartile 

5 range divided by 1 . 3 49. 

6 For determining outliers of the data distribution 58 the inter quartile range (IQR) is used. 

7 Any data elements 56 greater than the value of the 75 th percentile (i.e., the end point of bin 75) 

8 plus 1.5* IQR can typically be considered an outlier. 

9 With reference to Fig. 8, a graphic illustration showing the approximation error if a data 

10 distribution 58 is treated as a normal distribution verses if the data distribution 58 is treated as a 

11 non-normal distribution using the system of Fig. 1 is shown. Location 900 indicates a plot of the 

12 standard deviation found with respect to the value of the data element 56 when plotting quantiles. 

13 In contrast, location 903 shows same plot if a normal distribution is assumed. Other distributions 

14 can be evaluated for error as well, for example, Weibull, lognormal, Poisson, F, Chi-square, etc. 

15 With reference to Fig. 9, a data model 100 illustrating an aggregated data model 100 

16 aggregated from the data distribution models 102 of the data distributions 58 of Fig. 5 is shown. 

17 Linear interpolation and 100 bins are used to combine the data distribution models 102 from 

18 product M and Product W as shown in Fig 1 . Two tables were constructed to summarize the 

19 distributions into 100 bins as shown in Appendix 1. The above described methods are then 

20 applied to determine the start point, end point, and number of points for each bin 180. 
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Appendix 1: Summary tables for Product W, Product M and combined aggregation 
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1 CLAIMS 

2 WHAT IS CLAIMED IS : 

3 1 . A system for creating an aggregated data model from a plurality data distribution models, 

4 each data distribution model describing a data distribution having one or more data 

5 elements, each data element having a value, each data distribution model having one or 

6 more bins, each bin comprising a start point having a value, an end point having a value, 

7 a value indicating the number of data elements for each bin, and a polynomial formula 

8 associated with each bin, the polynomial formula approximating the data elements for the 

9 respective bin, comprising: 

10 a processor; and 

1 1 a computer program executable on a processor, the computer program adapted to 

12 perform the following steps: 

13 (a) determining which start point has the minimum value and which end point 

14 has the maximum value of all of the bins of all of the data distribution 

15 models; 

16 (b) setting a start point of a first bin of the aggregated data model to said start 

17 point determined to have the minimum value; 

18 (c) setting an end point of a last bin of the aggregated data model to said end 

19 point determined to have the maximum value; 

20 (d) determining a total number of a plurality of points for the aggregated data 

21 model by adding the values indicating the number of data elements from 

22 all bins from all data distribution models; 

23 (e) approximating the data elements in the data distribution described by each 

24 data distribution model using the start point, polynomial formula, and 

25 number of data elements for each bin in each respective data distribution 

16 
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26 model, each approximated data element comprising one of said points in 

27 the aggregated data model; 

28 (f) sorting the points from minimum to maximum; 

29 (g) distributing the points into one or more bins in the aggregated data model 

30 such that a substantially equal number of points are in each bin of the 

3 1 aggregated data model; and 

32 (h) determining a polynomial formula with the sorted data elements for each 

33 bin of the aggregated data model. 

12, The system of claim 1 , wherein the computer program is further for determining the end 

2 point for each bin in the aggregated data model. 

1 3. The system of claim 1 , wherein the computer program is adapted to perform the step of 

2 distributing the points into the one or more bins of the aggregated data model according 

3 to the following formula:: 

4 (a) if the number of points in the aggregated data model is equally divisible 

5 into the number of bins, the end point of the first bin is equal to the value 

6 of the ith point in the aggregated data model, wherein i is the number of 

7 points in each bin determined by dividing the points equally into the 

8 number of bins, wherein the value of the end point of each bin is equal ith 

9 point after the last point in the proceeding bin, wherein the start point of 

10 each bin is equal to the point after the last point of the previous bin, else 

1 1 (b) if the number of data elements in the points is not equally divisible by the 

12 number of bins, then the number of points in each bin is determined by 

13 dividing the number of points by the number of bins, and then adding one 

14 to the count of the points in each of a number of bins equal to the 
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remainder after dividing, wherein the bins that have one added to the count 
is determined according to the following formula: 
for k from 1 to r 

bi%dd^INT((n*k)/(r+l)) 

next k 

wherein bin^^ is the sequential bin number to add one to the count of 

points to include therein, n is the total number of bins in the aggregated 

data model, r is the remainder from dividing the number of points in the 
data distribution by the number of bins, and INT is a function for rounding 
the result of the bracketed formula to produce an integer result. 

4. The system of claim 1 , wherein the computer program is for performing separately for 

each bin of the aggregated data model, the steps of approximating the data elements for 
each bin, determining the end point for each bin, and determining the polynomial formula 
for each bin. 

5. The system of claim 1, wherein each data distribution model is the result of the computer 
program performing a the following steps: 

(a) sorting the data elements in from minimum to maximum for each data 

distribution; 

(b) computing the number of data elements in each data distribution; 

(c) determining the value of the start point and the value of the end point of 
each bin by dividing the data elements into a plurality of substantially 
equal sized bins for each data distribution; 

(d) counting the number of data elements in each bin for each data 
distribution; and 
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1 1 (e) computing each distribution model for each data distribution, each 

12 distribution model comprising, for each bin, the start point of the bin, the 

13 end point of the bin, and the number of data elements in the bin. 

1 6. The system of claim 5, wherein the computer program is adapted to perform the 

2 following steps for determining the start points and end points of the bins for each data 

3 distribution model: 

4 selecting as the start point of the first bin the value of the data element having the 

5 minimum value in the sorted data distribution; 

6 determining the start point and end point of each bin according to the following 

7 criteria: 

8 (c) if the number of data elements in the data distribution is equally divisible 

9 into the number of bins, the end point of the first bin is equal to the value 

10 of the ith data element in the data distribution, wherein i is the number of 

1 1 data elements in each bin determined by dividing the data elements 

12 equally into the number of bins, wherein the value of the end point of each 

13 bin is equal ith data element after the last data element in the proceeding 

14 bin, wherein the start point of each bin is equal to the data element after 

15 the last data element of the previous bin, else 

16 (d) if the number of data elements in the data distribution is not equally 

17 divisible by the number of bins, then the number of data elements in each 
IS bin is determined by dividing the number of data elements by the number 

19 of bins, and then adding one to the count of the data elements in each of a 

20 number of bins equal to the remainder after dividing, wherein the bins that 

21 have one added to the count is determined according to the following 

22 formula: 
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23 for k from 1 to r 

24 bin add = INT((n*k)/(r+l)) 

25 next k 

26 wherein bin ad< j is the sequential bin number to add one to the count of 

27 data elements to include therein, n is the total number of bins in the data 

28 distribution model, r is the remainder from dividing the number of data 

29 elements in the data distribution by the number of bins, and INT is a 

30 function for rounding the result of the bracketed formula to produce an 

31 integer result. 

17, The system of claim 6, wherein the computer program is further for performing the step 

2 of counting by counting, for each bin, each data element satisfying the following formula; 

3 start point < element value <= end point 

4 wherein the bin start point is the start point of the respective bin, element value is the 

5 value of each data element in each bin, and end point is the end point of the respective 

6 bin. 

1 8. The system of claim 7, comprising a storage medium for storing each data distribution 

2 model by storing, for each bin, the start point, the end point, the number of data elements, 

3 and the parameters of the polynomial formula. 

1 9. The system of claim 1, wherein the computer program is further for performing one or 

2 more statistical analysis using the aggregated data model. 

1 10. The system of claim 9, wherein the statistical analysis performed comprises determining 

2 the range of the points of the aggregated data model analyzed by subtracting end point of 
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3 the last bin in the aggregated data model from the start point of the first bin in the 

4 aggregated data model. 

1 11. The system of claim 9, wherein the statistical analysis performed comprises determining 

2 the inter quantile range of the points of the aggregated data model. 

1 12. The system of claim 9, wherein the statistical analysis performed comprises determining 

2 the median value of the aggregated data model by determining a number j computed by 

3 dividing the number of bins by 2, and then reading the value of the end point of the jth 

4 bin as the median value if the number of bins in the aggregated data model is equally 

5 divisible by 2 or by reading the interpolated value using the polynomial function of the 

6 mid point of the jth bin if the number of bins in the aggregated data model is not equally 

7 divisible by 2. 
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1 13. A method for creating an aggregated data model from a plurality data distribution models, 

2 each data distribution model describing a data distribution having one or more data 

3 elements, each data element having a value, each data distribution model having one or 

4 more bins, each bin comprising a start point having a value, an end point having a value, 

5 a value indicating the number of data elements for each bin, and a polynomial formula 

6 associated with each bin, the polynomial formula approximating the data elements for the 

7 respective bin, the method comprising: 

8 determining which start point has the minimum value and which end point has the 

9 maximum value of all of the bins of all of the data distribution models; 

10 setting a start point of a first bin of the aggregated data model to said start point 

1 1 determined to have the minimum value; 

12 setting an end point of a last bin of the aggregated data model to said end point 

13 determined to have the maximum value; 

14 determining a total number of a plurality of points for the aggregated data model 

15 by adding the values indicating the number of data elements from all bins from all data 

1 6 distribution models; 

17 approximating the data elements in the data distribution described by each data 

18 distribution model using the start point, polynomial formula, and number of data 

19 elements for each bin in each respective data distribution model, each approximated data 

20 element comprising one of said points in the aggregated data model; 

21 sorting the points from minimum to maximum; 

22 distributing the points into one or more bins in the aggregated data model such 

23 that a substantially equal number of points are in each bin of the aggregated data model; 

24 and 
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25 determining a polynomial formula with the sorted data elements for each bin of 

26 the aggregated data model. 

1 14. The method of claim 13, comprising determining the end point for each bin in the 

2 aggregated data model. 

1 15. The method of claim 1 3 , wherein the step of distributing the points into the one or more 

2 bins of the aggregated data model is performed according to the following formula:: 

3 (e) if the number of points in the aggregated data model is equally divisible 

4 into the number of bins, the end point of the first bin is equal to the value 

5 of the ith point in the aggregated data model, wherein i is the number of 

6 points in each bin determined by dividing the points equally into the 

7 number of bins, wherein the value of the end point of each bin is equal ith 

8 point after the last point in the proceeding bin, wherein the start point of 

9 each bin is equal to the point after the last point of the previous bin, else 

10 (f) if the number of data elements in the points is not equally divisible by the 

1 1 number of bins, then the number of points in each bin is determined by 

12 dividing the number of points by the number of bins, and then adding one 

13 to the count of the points in each of a number of bins equal to the 

14 remainder after dividing, wherein the bins that have one added to the count 

15 is determined according to the following formula: 

16 for k from 1 to r 

IV bin add = INT(n*k) / (r + 1)) 

18 nextk 

19 wherein bin^ is the sequential bin number to add one to the count of 

20 points to include therein, n is the total number of bins in the aggregated 
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21 data model, and r is the remainder from dividing the number of points in 

22 the data distribution by the number of bins, and INT is a function for 

23 rounding the result of the bracketed formula to produce an integer result. 

1 16. The method of claim 13, wherein the steps of approximating the data elements for each 

2 bin, determining the end point for each bin, and determining the polynomial formula for 

3 each bin are performed separately for each bin of the aggregated data model. 

1 17. The method of claim 13, comprising creating each data distribution model using the 

2 following steps: 

3 (f) sorting the data elements in from minimum to maximum for each data 

4 distribution; 

5 (g) computing the number of data elements in each data distribution; 

6 (h) determining the value of the start point and the value of the end point of 

7 each bin by dividing the data elements into a plurality of substantially 

8 equal sized bins for each data distribution; 

9 (i) counting the number of data elements in each bin for each data 

10 distribution; and 

1 1 0) computing each distribution model for each data distribution, each 

12 distribution model comprising, for each bin, the start point of the bin, the 

13 end point of the bin, and the number of data elements in the bin. 

1 18. The method of claim 1 7, comprising determining the start points and end points of the 

2 bins for each data distribution model using the following steps: 

3 selecting as the start point of the first bin the value of the data element having the 

4 minimum value in the sorted data distribution; 
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5 determining the start point and end point of each bin according to the following 

6 criteria: 

7 (g) if the number of data elements in the data distribution is equally divisible 

8 into the number of bins, the end point of the first bin is equal to the value 

9 of the ith data element in the data distribution, wherein i is the number of 

10 data elements in each bin determined by dividing the data elements 

1 1 equally into the number of bins, wherein the value of the end point of each 

12 bin is equal ith data element after the last data element in the proceeding 

13 bin, wherein the start point of each bin is equal to the data element after 

14 the last data element of the previous bin, else 

15 (h) if the number of data elements in the data distribution is not equally 

16 divisible by the number of bins, then the number of data elements in each 

17 bin is determined by dividing the number of data elements by the number 

18 of bins, and then adding one to the count of the data elements in each of a 
!9 number of bins equal to the remainder after dividing, wherein the bins that 

20 have one added to the count is determined according to the following 

21 formula: 

22 for k from 1 to r 

23 bin add = INT(n*k) / (r + 1)) 

24 next k 

25 wherein bin a dd is the sequential bin number to add one to the count of 

26 data elements to include therein, n is the total number of bins in the data 

27 distribution model, r is the remainder from dividing the number of data 

28 elements in the data distribution by the number of bins, and INT is a 
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29 function for rounding the result of the bracketed formula to produce an 

30 integer result. 

1 19. The method of claim 1 8, wherein the step of counting is performed by counting, for each 

2 bin, each data element satisfying the following formula: 

3 start point < element value <= end point 

4 wherein the bin start point is the start point of the respective bin, element value is the 

5 value of each data element in each bin, and end point is the end point of the respective 

6 bin. 

1 20. The method of claim 1 9, comprising storing each data distribution model by storing, for 

2 each bin, the start point, the end point, the number of data elements, and the parameters of 

3 the polynomial formula. 

1 21. The method of claim 1 3, comprising performing one or more statistical analysis using the 

2 aggregated data model. 

1 22. The method of claim 2 1 , wherein the statistical analysis performed comprises 

2 determining the range of the points of the aggregated data model analyzed by subtracting 

3 end point of the last bin in the aggregated data model from the start point of the first bin 

4 in the aggregated data model. 

1 23 . The method of claim 2 1 , wherein the statistical analysis performed comprises 

2 determining the inter quantile range of the points of the aggregated data model. 

1 24. The method of claim 21, wherein the statistical analysis performed comprises 

2 determining the median value of the aggregated data model by determining a number j 

3 computed by dividing the number of bins by 2, and then reading the value of the end 
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point of the jth bin as the median value if the number of bins in the aggregated data 

model is equally divisible by 2 or by reading the interpolated value using the polynomial 
function of the mid point of the jth bin if the number of bins in the aggregated data model 
is not equally divisible by 2. 
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METHOD AND SYSTEM FOR AGGREGATING DATA DISTRIBUTION MODELS 

ABSTRACT 

A system and method for creating an aggregated data model from a plurality data 

distribution models having bins approximating data elements in a plurality of data distributions is 
disclosed. Each bin of each data distribution model has a polynomial formula for approximating 
data elements in a respective data distribution. A method for creating the aggregated data model 
comprises: determining a start point of the aggregated data model having the minimum value and 
an end point having a maximum value of all of the bins of all of the data distribution models; 
setting a start point of a first bin of the aggregated data model; setting an end point of a last bin 
of the aggregated data model; determining a total number of points for the aggregated data 
model; approximating the data elements described by each data distribution model, each 
approximated data element comprising one point in the aggregated data model; sorting the points 
from minimum to maximum; distributing the points into one or more bins in the aggregated data 
model; and determining a polynomial formula for the points for each bin of the aggregated data 
model. 
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